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Studying different concepts by frequently alternating between them (i.e., interleaving), 
improves discriminative contrast between different categories, while studying each con- 
cept in separate blocl<s emphasizes the similarities within each category. Interleaved study 
has been shown to improve learning of high similarity categories by increasing between- 
category comparison, while blocked study improves learning of low similarity categories 
by increasing within-category comparison. In addition, interleaved study presents greater 
temporal spacing between repetitions of each category compared to blocked study, which 
might present long-term memory benefits. In this study we asked if the benefits of 
temporal spacing would interact with the benefits of sequencing for making comparisons 
when testing was delayed, particularly for low similarity categories. Blocked study might 
be predicted to promote noticing similarities across members of the same category and 
result in short-term benefits. However, the increase in temporal delay between repetitions 
inherent to interleaved study might benefit both types of categories when tested after a 
longer retention interval. Participants studied categories either interleaved or blocked and 
were tested immediately and 24 h after study. We found an interaction between schedule 
of study and the type of category studied, which is consistent with the differential emphasis 
promoted by each sequential schedule. However, increasing the retention interval did not 
modulate this interaction or resulted in improved performance for interleaved study. Overall, 
this indicates that the benefit of interleaving is not primarily due to temporal spacing during 
study, but rather due to the cross-category compansons that interleaving facilitates. We 
discuss the benefits of temporal spacing of repetitions in the context of sequential study 
and how it can be integrated with the attentional bias hypothesis proposed by Carvalho 
and Goldstone (2014a). 
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INTRODUCTION 

Much of our knowledge is acquired inductively. By studying sev- 
eral examples of a given concept we are able to extract the relevant 
information from those examples and generalize the concept they 
instantiate. For example, upon seeing several instances of typical 
birds one might infer that all birds have beaks and feathers. In 
the context of inductive learning, the way information is orga- 
nized can have a deep impact on what is learned. If learning is not 
equally efficient under different conditions, even though the same 
information is presented, it becomes particularly relevant to iden- 
tify not only how different conditions affect learning but also how 
learning can be optimized (Atkinson, 1972; Pavlik and Anderson, 
2008). 

Given the potentially large influence of example sequenc- 
ing, most category learning studies employ a neutral, randomly 
ordered presentation of exemplars and categories when induc- 
tively teaching categories. However, outside the lab, information is 
not usually sequenced randomly. For example, a typical textbook 
for "Introduction to Statistics" will start with coverage of descrip- 
tive statistics, followed by probability theory and then hypothesis 
testing, i.e., concepts are blocked. An alternative to the blocked 
study sequence described above is interleaving different concepts. 



In interleaved study, different concepts are successively alternated. 
Put concretely, two possible ways to learn the concepts A, B, and C 
from examples is by blocking the examples of each concept (e.g., 
Ai A2 A3 Bi B2 B3 Ci C2 C3 ), or by interleaving examples of all the 
concepts (e.g., Ai Bi Ci A2 B2 C2 A3 B3 C3). Importantly, these 
two schedules of presentation provide different study experiences, 
which has the potential to change what we learn (e.g., Goldstone, 
1996; Schyns and Rodet, 1997), and how well we learn it (e.g., 
Kornell and Bjork, 2008; Wahlheim et al, 2011). 

Research in skill acquisition has demonstrated a clear advan- 
tage for interleaved study. For example. Shea and Morgan (1979) 
had participants learn three different sequences of complex move- 
ments in an apparatus where each sequence was prompted by the 
presentation of a different light color. All participants practiced 
each sequence 18 times. Critically, for half of the participants the 
practice of each of the three different sequences was interleaved 
while for the other half it was blocked by light color. The results 
showed that during study participants in the blocked condition 
performed better than those in the interleaved condition. How- 
ever, this pattern was reversed in a delayed transfer task. These 
results have been extended to other types of learning, namely con- 
cept learning using artist styles (Kornell and Bjork, 2008; Kornell 
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etal, 2010; Kang and Pashler, 2012; Zulkiply and Burt, 2013), 
butterfly and bird species (Wahlheim et al., 201 1; Birnbaum et al., 
2013; Zulkiply and Burt, 2013), mathematical and clinical con- 
cepts (Rohrer and Taylor, 2007; Taylor and Rohrer, 2010; Zulkiply 
etal., 2012) as well as novel categories (Zulkiply and Burt, 2013; 
Carvalho and Goldstone, 2014a,b). 

Although a diverse set of concepts has been used to show 
a learning advantage for interleaved study, a common char- 
acteristic is that the items from the to-be-learned categories 
have a high degree of similarity and are, therefore, hard to 
discriminate or encode individually (Zulkiply and Burt, 2013; 
Carvalho and Goldstone, 2014a). Hence, presenting items from 
different categories close in sequence optimizes discriminative 
contrast leading to better learning (Kang and Pashler, 2012; 
Birnbaum etal, 2013; Zulkiply and Burt, 2013; Carvalho and 
Goldstone, 2014a). Conversely, when each item is significantly 
different from items in the same and different categories, i.e., 
when low similarity categories are used, research has shown that 
blocked study results in improved learning (Kurtz and Hovland, 
1956; Whitman and Garner, 1963; Goldstone, 1996; Carpenter 
and Mueller, 2013; Zulkiply and Burt, 2013; Carvalho and Gold- 
stone, 2014a). In the case of learning low similarity categories, 
the difficulty is not primarily in discriminating items from differ- 
ent categories but rather finding similarities within the categories, 
which is optimized by often repeating the same category close in 
time (Carvalho and Goldstone, 2014a,b). 

It is therefore possible that category learning depends upon the 
match between the study sequence and the type of category being 
studied (Zulkiply and Burt, 2013; Carvalho and Goldstone, 2014a) 
or the learning situation (Carvalho and Goldstone, 2014b; Rawson 
et al., 2014). However, blocked and interleaved study do not differ 
only on the type of contrast they emphasize. They also differ in the 
amount of temporal spacing between successive repetitions of the 
same category. Interleaving study maximizes the temporal spacing 
between repetitions, while blocking study minimizes the temporal 
spacing between repetitions. 

Increasing the temporal spacing between verbatim repetitions 
during study confers significant mnemonic benefits (Glenberg, 
1976; Glenberg and Lehmann, 1980; Pashler etal, 2007; Cepeda 
etal, 2009; Delaney etal., 2010) and it has been proposed that 
interleaved study benefits for learning are at least in part due to the 
temporcil spacing between repetitions of the same category (Shea 
and Morgan, 1979; Lee and Magill, 1985). Interleaved study does 
not involve temporally spaced token repetitions but rather tem- 
porally spaced repetitions of the same category type. When study 
is blocked by category, even though different specific items might 
be presented on successive study trials, the same category response 
is activated for all of them. On the contrary, interleaved study of 
several categories requires alternating category assignments more 
frequently. An increased temporal spacing between repetitions of 
a category increases forgetting of the previous encounter with that 
category and increases the effort in recalling previous encoun- 
ters (Bjork and Allen, 1970; Cuddy and Jacoby, 1982; Krug etal., 
1990). The increased effort to recall the previous encounter typi- 
cally results in better long-term retention of the repeated elements 
across different items of the same category because they were 
present in both encounters (Vlach et al, 2008, 2012, 2014). 



In the case of verbatim repetitions of items, the benefits of spac- 
ing are sometimes not seen when the test takes place shortly after 
learning but are seen at longer retention intervals between study 
and test (e.g., Peterson etal, 1962a,b; Glenberg and Lehmann, 
1980; Bloom and Shuell, 1981; Krug etal, 1990; Rohrer and 
Taylor, 2006). Thus, the optimal temporal spacing between repe- 
titions depends on the interval between the last study repetition 
and test (i.e., the retention interval). Initial proposals suggested 
that increasing the temporal spacing between repetitions improves 
memory if the retention interval is long (Crowder, 1976) or pro- 
portionally longer than the temporal spacing between repetitions 
during study (Murray, 1983). Recent reviews of the literature indi- 
cate that the benefits of increasing the temporal spacing during 
study depend on the length of the retention interval (Donovan 
and Radosevich, 1999; Janiszewski et al, 2003; Cepeda et al., 2006). 
Cepeda et al. (2008) compared a set of temporal lags during study 
in the context of different retention intervals and noted that when 
retention interval increases the optimal temporal spacing during 
study increases as well. Similar evidence of an effect of reten- 
tion interval length exists in the case of non-verbatim repetitions 
(Ste-Marie etal, 2004). 

While much research has addressed how the benefits of tempo- 
rally spacing repetitions in word list or paired associates learning 
tasks, little research has questioned how retention interval and 
spacing interact in category learning. An important question, 
therefore, is whether the previously found benefits of blocked 
study for low similarity categories are only evident at short 
retention intervals while interleaved study promotes long-term 
retention. One possible conceptualization of why this might be 
the case is as follows (see Table 1 for an overview of these 
predictions). 

When discriminating individual items is easy (as in the case 
of low similarity categories) and the test is immediate, learn- 
ers might be able to memorize individual items during study 
and use that memory to categorize novel items during an 
immediate test (Ashby and O'Brien, 2005). This strategy might 
provide immediate benefits, similar to what is seen in verba- 
tim massed repetitions of items, but provide a transient memory 
trace that will result in decreased long-term memory (Glenberg, 
1976; Glenberg and Lehmann, 1980). Under this conceptualiza- 
tion, spacing repetitions of low similarity categories (i.e., inter- 
leaved study) wiU result in improved long-term retention of each 
individual item by increasing the temporal spacing between repe- 
titions. With increasing temporal delays learners might engage in a 
recursive recollection process (Murray, 1983; Ross, 1984; Ross and 
Kennedy, 1990; Benjamin and TuUis, 2010; Wahlheim et al, 2014). 
Every time an item of a category is presented, learners will try to 
remember the previous items from the same category seen and 
this recursive retrieval is likely to result in learning benefits (Vlach 
etal, 2008, 2012, 2014; Birnbaum etal, 2013). When discrimi- 
nating items is hard, as in the case of high similarity categories, 
memorizing individual items is less likely and learners wOl resort 
to encoding only the relevant features of each category by con- 
trasting them. Interleaved study of categories optimizes attending 
and encoding these features (Kang and Pashler, 2012; Birnbaum 
etal., 2013; Carvalho and Goldstone, 2014a). However, increas- 
ing the temporal spacing between each category, by, for example. 
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Table 1 | Predictions for each study schedule and category structure for test at different retention intervals. 



Study schedule 


Attentional bias° 


Type of category 


Test outcome without delay^ 


Test outcome with delay 


Blocked 


Within-category similarities 


Low similarity 


Improved learning 


Worse learning 






High similarity 


Worse learning 


Worse learning 


Interleaved 


Between-category differences 


Low similarity 


Worse learning 


Improved learning 






High similarity 


Improved learning 


Improved learning 



The kind of study schedule, attentional bias and categories to be learned are described in the first three columns while the last two columns present predictions for 
learning and retention performance. 

^Predictions based on previous work showing an interaction between category structure and schedule of study (Zulkiply and Burt, 2013; Carvalho and Goldstone, 
2014a). 



including another task between interleaved presentations of dif- 
ferent categories, hinders noticing these differences (Kang and 
Pashler, 2012; Birnbaum etal., 2013). We will return to the dif- 
ferences between exemplar and rule encoding and its potential 
importance for understanding sequencing effects in the Section 
"General Discussion." 

In this paper we investigate the relative benefits of category 
comparison and temporal delay during study and its interaction 
with retention interval. We approach this questions by teaching 
learners two different types of categories: high similarity categories 
in which all the stimuli are very similar to each other (both within 
and between the three categories to be learned), and low simi- 
larity categories in which any pair of stimuli share relatively few 
similarities. Additionally, learners' categorization ability for the 
items studied and new transfer items was tested both immediately 
and 24 h after the initial study. To foreshadow, the differential 
attentional biases promoted by each schedule (interleaving and 
blocking) will confer differential relative benefits to different types 
of categories. Moreover, the benefit of the temporal delay between 
repetitions during study will benefit category learning following 
interleaved study at increased retention intervals for both category 
structures. 

EXPERIMENT 1A 
METHOD 

Participants 

A total of 1 78 undergraduate students at Indiana University volun- 
teered to participate in this study in return for partial course credit. 
Participants were randomly assigned to either the high similarity 
(N = 94) or low similarity {N = 84) condition. Data from a total 
of 65 participants were excluded from analyses due to failure to 
complete the second session {N = 16 for the high similarity con- 
dition and N = 18 for the low similarity condition), computer 
error (N = 3 for the high similarity condition and N = 1 for the 
low similarity condition), or failure to reach the criterion of 34% 
correct responses across the four blocks of the initial study phase 
(N = 24 for the high similarity condition and N = 3 for the low 
similarity condition). The higher rate of failure to reach criterion 
for the high compared to low similarity condition replicates pre- 
vious studies (Carvalho and Goldstone, 2014a), and is intuitively 
plausible from an inspection of Figure 1 and the highly confusable 
nature of the high similarity stimuli. 



Apparatus and stimuli 

The stimuli used were blob figures (see Figure 1). These stimuli 
were previously used by Carvalho and Goldstone (20 1 4a) . All blobs 
were created by randomly generating curvilinear segments. A sin- 
gle curvilinear segment defined each category and was present in 
all exemplars of that category. Across all of our experiments, two 
sets of six categories were used (three categories studied blocked 
and three studied interleaved, randomly selected for each partici- 
pant), a low-similarity set and a high-similarity set, for a total of 
12 categories. Each category was composed of 16 exemplars. 

In the high-similarity set, exemplars shared most of their fea- 
tures with all of the other exemplars in the same category and in 
each of the other five categories. Moreover, variation within each 
category was exactly the same for all categories, so that a difference 
that could exist between two exemplars in category 1 would also 



Category 1 Category 2 Category 3 




FIGURE 1 I Examples of stimuli used in the experiments presented 
here. All stimuli were created by randomly generating curvilinear segments 
that were then added together. Each blob was constituted by eight features 
(each feature was a specific spatial position in the blob). 
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exist between two exemplars of each of the other categories in 
the set. In the low-similarity set, exemplars within each category 
shared only the category-relevant feature. Moreover, exemplars 
from different categories differed in all of their features. Some of 
the exemplars had an overall round shape, and others an over- 
all oblique shape (this variability was equally distributed across 
categories). 

As a cover story, participants were told that a recent expedition 
to Mars had recovered several cells of alien organisms. Each cell 
could be categorized into one of three species solely on the basis 
of its perceptual features. Stimuli were presented on a computer 
screen, and participants responded by pressing one of three but- 
tons drawn on the screen, with an inconsistent mapping between 
location of the button and category label. 

Each category was composed of a total of 16 blobs. For each 
subject, eight blobs were randomly selected to be used during study 
while the remainder were used during test only. Each category was 
given a novel name, randomly selected for each participant from 
the following pool: "beme," "kipe," "vune," "coge," "zade," and 
"tyfe" (Hendrickson etal., 2012). 

Design and procedure 

This experiment had four conditions manipulated within- 
participants (schedule of study: interleaved vs. blocked study; 
and time of test: immediate vs. 24-h delayed) and two condi- 
tions manipulated between-participants (type of category: high 
similarity vs. low similarity categories). Participants started by 
completing one of the study conditions and the corresponding 
immediate transfer test and then completed the next study condi- 
tion and the immediate test for that condition during their initial 
visit to the lab (order of conditions was counterbalanced across 
participants). Participants returned to the lab approximately 24 h 
after finishing the second transfer test for a follow-up session. 

Study phase 

Each study phase was composed of four blocks of 48 trials each. 
Each trial started with a presentation of one stimulus in the cen- 
ter of the screen for 500 ms. After the blob was removed, the 
participant was asked to classify the blob they had just seen into 
one of three species by clicking the button on the screen with the 
correct species name. The label of each of the buttons was ran- 
domized on each trial so that the absolute position of a button 
on the screen could not be reliably associated with a category. 
Immediately after a response was recorded, the blob was pre- 
sented again in the center of the screen along with the correct 
category assignment and an indication as to whether the partici- 
pant's response was correct or incorrect. Feedback was presented 
for 2000 ms. A 1000 ms intertrial interval followed and then a new 
trial began. 

The two schedules of study (blocked vs. interleaved) differed 
only in the frequency of category change during study and the cat- 
egory labels. In the blocked condition, the presented categories 
alternated 25% of the time, whereas in the interleaved condi- 
tion, they alternated 75% of the time. Thus, in the interleaved 
condition, the probability of a blob being followed by a blob of 
the same category was low, whereas for the blocked condition, 
this probability was high. We used this probabilistic approach 



rather than creating purely interleaved or blocked conditions in 
order to diminish the possibility that participants noticed the 
pattern of alternation in responses, which would affect catego- 
rization accuracy (see Carvalho and Goldstone, 2014a for analysis 
and discussion of these effects). 

Immediate transfer test 

Immediately after each study phase participants completed a trans- 
fer task. This task was composed of a total of 48 trials. Half of the 
trials were old trials in which an exemplar that had been presented 
before was presented and the other half were new trials in which 
a novel exemplar was presented. The new stimuli were similar to 
the ones studied, with new instantiations of the unique features 
(i.e., the unique feature presented with different non-diagnostic 
features) . A random sequence of categories was used, meaning that 
the probability of successive items belonging to the same category 
was 33%. On each transfer trial the stimulus was presented in the 
center of the screen for 500 ms. Once the stimulus was removed 
from the screen, the participants had to categorize it into one of 
the species they had just studied by clicking one of the buttons 
on the screen. The label of each of the buttons was randomized 
on each trial. No feedback was provided during the immediate 
transfer test. 

Delayed transfer test 

On their second visit to the lab, participants started by completing 
a refresher training task. The refresher task was given because 
pilot results indicated that some participants had memory of the 
previous day's categorization task, but did not remember which 
label had been associated with each stimulus type^. This refresher 
task was composed of 24 training trials similar to the study task 
trials from the previous day, using the same study schedule as in in 
the previous session. Immediately after the refresher training task, 
participants completed a transfer test similar to the immediate 
transfer set they had completed the day before, using the same set 
of stimuli and with no feedback provided. 

RESULTS AND DISCUSSION 

We begin by analyzing the data from the study phase. These results 
are depicted in Figure 2. First, we focus on performance over the 
four blocks of the initial study session. One initial question is 
whether there is an interaction between the type of category and 
study schedule. This interaction was not reliable (p > 0.05), thus 
study performance seems to be approximately equivalent for each 
study schedule across the two category structures. 

However, performance is overall better during blocked study 
when compared to interleaved study, _F( 1,1 11) = 41.51,p < 0.0001, 

= 0.09. This result parallels previous evidence (e.g.. Shea and 
Morgan, 1979; Carvalho and Goldstone, 2014a) showing a benefit 



^We opted to include a refresher to remind participants of the mapping between 
blob groupings and category labels because we are interested in how well partici- 
pants learned the groupings, i.e., category structure, and not whether the mapping 
between the learned structure and the category label is also maintained. Memorizing 
novel names when learning groupings of novel stimuli is a demanding task (Ashby 
and O'Brien, 2005) and learners' ability to categorize new items is more influenced 
by changes in how the objects are grouped than the labels used (i.e., changing the 
labels for group A and B is less detrimental than mixing the items from group A and 
B into new categories, e.g., Hendrickson et al., 2012). 
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FIGURE 2 1 Performance during the study session and the 
refresher component for Experiment 1A. The left panel shows 
performance for interleaved and blocked study of high similarity 
categories. The right panel shows performance for interleaved 



and blocked study for low similarity categories. Error bars 
indicate standard errors of the means. Chance-level performance 
in this task was 0.33. The vertical dashed line represents 
session break. 



of blocked study during study. However, this study advantage does 
not always transfer to an equivalent advantage of blocked study 
during test. Blocked study presents a higher level of response pre- 
dictability - this fact might help explain why performance is better 
during blocked presentation. Finally, low similarity categories are 
also easier to learn than high similarity ones, resulting in overall 
better performance, P(l,lll) = 6.83, p = 0.01, ri^ = 0.03. 

Notwithstanding these differences, we see an improvement in 
the ability to categorize the blobs across the study phase for all 
conditions, f (3,333) = 249.12, p < 0.0001, t)^ = 0.25. However, 
this improvement is greater for low similarity categories com- 
pared to high similarity categories, f (3,333) = 10.16, p < 0.0001, 
T)^ = 0.01 and for interleaved study compared to blocked study, 
P(3,333) = 10.30, p < 0.0001, t]^ = 0.01. These results are also 
similar to previous evidence comparing interleaved and blocked 
study. 

Finally, we compared performance in the last block of study 
in day 1 with performance on the refresher of day 2. Overall per- 
formance was lower on the second day refresher than on the last 
block of the study session in day 1, f(112) = 3.58, p < 0.001. This 
effect seems to be mostly driven by the results in the low similarity 
category structure (see Figure 2). This slight decrease in perfor- 
mance is expected given the time interval between the last study 
block and the refresher. No effects of schedule of study, similarity 
structure of the categories or interaction between the two variables 
were found for the refresher session (all ps > 0.05). 

We now turn our attention to the results during test for both 
novel and studied items. The main results are depicted in Figure 3. 
As a reminder, there were two test sessions: one that took place 
immediately after the corresponding study session and another 
that took place 24 h later. Analyses of these data, revealed a main 
effect of study schedule, with overall better performance for inter- 
leaved study than blocked study, f(l,lll) = 14.99, p < 0.001, 
rig = 0.04. However, this effect is qualified by a series of relevant 
interactions. 



The three questions of interest relative to performance in the 
test phase are ( 1 ) whether there is an interaction between type of 
category used and the study schedule used, (2) whether there is an 
overall improvement in performance following interleaved study 
between immediate test and the 24 h delayed test, and (3) whether 
the interaction pattern seen in immediate transfer tests changes 
when transfer is tested 24 h later. As can be seen from the Figure 3, 
there is an interaction between the type of category used and the 
schedule of study, f (1,111) = 4.26, p = 0.04, t)^ = 0.009. This 
interaction shows that while interleaved study results in the best 
transfer performance for high similarity categories, this advantage 
is considerably reduced for low similarity categories. In the case 
of low similarity categories, no schedule of presentation seems to 
result in overall better performance. Moreover, this interaction 
does not change with transfer test time, i.e., it remains the same 
24 h after study. However, a statistically reliable three-way inter- 
action between category type, schedule of study and test session, 
P(l,lll) = 5.32, p = 0.02, Ti^ = 0.0007, seems to indicate that for 
low similarity categories interleaved performance is better for old 
items in the immediate transfer test when compared to blocked 
study for the same type of items but this difference disappears for 
the 24-h delayed test session. 

Finally, performance is overall better for old stimuli com- 
pared to new ones, P(l,lll) = 135.90, p < 0.0001, = 0.04. 
This result is indicative that, at least in part, participants may be 
memorizing individual exemplars during study. Interestingly, the 
difference in performance between new and old stimuli is greater 
for low similarity categories compared to high similarity cate- 
gories, f(l,lll) = 75.87, p < 0.0001, Ti^ = 0.02. These results 
suggest that, given the greater number of discrimination points 
between individual stimuli in the low similarity categories, par- 
ticipants are more likely to have better differentiated individual 
memories for the low similarity stimuli. 

Overall the results from this experiment show an interaction 
between the schedule of study and the type of category on test 
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FIGURE 3 I Performance in the transfer tests of Experiment 1 A. The left 
panel depicts results for Items studied during the study phase while the right 
panel presents results for items not seen during the study phase. Results for 
the high similarity categories are presented In red while results for the low 



similarity categories are presented In blue. For each of these, the dashed lines 
represent blocked study while the solid lines represent interleaved study. 
Error bars Indicate standard errors of the means. Chance-level performance In 
this task was 0.33 and Is represented In the graphs by the black dashed line. 



performance, which remains unaltered with increases in retention 
interval. Moreover, there is no overall increase in the benefits of 
interleaved study with increased retention intervals. Performance 
during the transfer tests also does not seem to be the result of the 
differential difficulties found during study. There is an interaction 
between type of category and schedule of study at test, which is 
not seen during study. 

EXPERIMENT IB 

We designed Experiment IB to investigate the possibility that 
the findings in Experiment lA for the delayed tests are in part 
the result of the existence of a Refresher section immediately 
before those tests and not the learning that took place in the 
previous day. In this experiment a new group of participants 
completed only the second day session of Experiment lA. If 
the Refresher presented during this session were sufficient for 
participants to learn the categories, then we should see similar 
results here to what was found for the delayed test of Experi- 
ment lA. On the contrary, if the brief refresher section is not 
enough for participants to effectively learn the categories, we 
would expect a qualitative decline in performance compared to 
Experiment 1 A, as well as no performance differences between the 
two schedules of study and no interaction between schedule of 
study and category type during test, contrary to what is seen for 
Experiment lA. 

METHOD 
Participants 

A total of 63 Indiana University undergraduate students, who 
had not participated in the previous experiment, volunteered 
to participate in this study in exchange for partial course credit. 
Participants were randomly assigned to either the low similarity 
(N = 26) or high similarity {N = 37) conditions. No exclusion 
criteria were used to match inclusion criteria in the second day of 
Experiment lA. 



Apparatus and stimuli 

The same set of stimuli as in Experiment lA were used in this 
experiment. 

Design and procedure 

This experiment had two conditions manipulated within- 
participants (interleaved vs. blocked study), and two condition 
manipulated between-participants (high similarity vs. low sim- 
ilarity categories). Participants completed a task similar to the 
second session of Experiment lA. Participants started by complet- 
ing a short study task (the refresher task in Experiment lA) for 
three of the categories followed by immediate test for those cate- 
gories and then repeated these steps for the second group of three 
categories. Half the participants started with interleaved study of 
the categories and the other half with blocked study of the cat- 
egories. All other details not presented here were the same as in 
Experiment lA. 

RESULTS AND DISCUSSION 

The results for the study phase of Experiment IB are depicted 
in Figure 4 (in which the results of the refresher phase of 
Experiment lA are also depicted for comparison). As it can 
be seen from the Figure 4, performance is qualitatively worse 
during study in Experiment IB compared to performance in 
the refresher phase of Experiment lA. Moreover, performance 
in Experiment IB is overall better during blocked study when 
compared to interleaved study, f(l,61) = 58.84, p < 0.0001, 
r\Q = 0.33. The main effect of category structure and the 
interaction between the two variables were not reliable (both 
ps > 0.05). We also compared performance with chance level 
of 33% for each condition and type of category combination. 
Performance was reliably above chance only in the case of the 
blocked study condition, f(25) = 5.22, p < 0.0001 for low simi- 
larity categories and f(36) = 9.66, p < 0.0001 for high similarity 
conditions. 
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FIGURE 4 I Performance during the Study Phase of Experiment 1B for 
high and low similarity categories (left panel). The right panel presents 
data from the second day refresher only presented In Figure 2 and is 
depicted here for comparison purposes. Solid lines Indicate Interleaved 
study, while dashed lines Indicate blocked study. Error bars Indicate 
standard errors of the means. Chance-level performance in this task was 
0.33 and Is represented In the graphs by the horizontal dashed line. 



Turning now to performance during the immediate transfer 
test, the results indicate that overall participants' performance 
following only the refresher task is considerably worse than in 
the delayed test of Experiment 1 A and close to chance. The results 
of the transfer task are presented in the left panel of Figure 5 
along with the results from the delayed transfer test of Experiment 
lA (right panel) for comparison. A mixed ANOVA with type of 
item (new vs. old) and study schedule (interleaved vs. blocked) 



a: CD _ 

8 5- 

g 

o r-j 

o ° ~ 
qI 

o - 



Type of Item 

FIGURE 5 I Performance in the transfer task of Experiment 1B (left 
panel). The right panel depicts results for the delayed transfer of Experiment 
1 A for comparison purposes. Results for the high similarity categories are 
presented in red while results for the low similarity categories are presented 



as within-subject factors and category structure as a between- 
subject factor for the results of Experiment IB only showed an 
effect of category structure, f (1,61) = 6.67, p = 0.01, t]^ = 0.04, 
with better performance for low similarity categories, and type of 
stimuli, P(l,61) = 5.55, p = 0.02, 7]^ = 0.01, with better per- 
formance for old items. Moreover, the interaction between these 
two variables was also reliable. It was only when items were both 
old and had low similarity that categorization accuracy was appre- 
ciably above chance. When only one of these factor levels was 
present, accuracy was close to chance, _F(1,61) = 5.15, p = 0.03, 
T]^ = 0.009. No other main effect or interaction was statistically 
reliable (all Fs < 0). Overall performance is only slightly above 
chance, considerably worse than what is seen in the second day 
of Experiment lA, and no effect of study schedule or category 
structure were found. This demonstrates that the results found 
for the delayed transfer test of Experiment lA are unlikely to 
be the result of the short refresher study session but rather are 
the result of the extensive learning phase that took place 24 h 
earlier. 

GENERAL DISCUSSION 

Taken together, the results presented here suggest that (a) the 
advantage of increased temporal lag between repetitions of the 
same category is not being masked by the use of an immedi- 
ate test with low similarity categories - there was no difference 
between immediate generalization and a 24-h delayed generaliza- 
tion for any of the category structures. Similarly, (b) there was no 
overall increase in interleaved study benefits with an increase in 
retention interval, unlike previous evidence with verbatim rep- 
etitions. In addition, (c) different study sequences change the 
relative emphasis on different properties of the category items as 
seen by the relative learning benefit of each schedule, measured by 
generalization to novel items. 



-"-High Sim Interleaved 
-•-Higti Sim Blocked 
-"-Low Sim interleaved 
-•-Low Sim Blocked 



Type of Item 

In blue. For each of these, the dashed lines represent blocked study while the 
solid lines represent Interleaved study. Error bars Indicate standard errors of 
the means. Chance-level performance In this task was 0.33 and Is 
represented in the graphs by the black dashed line. 
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As we mentioned in tlie Section "Introduction," in the con- 
text of verbatim repetitions, greater temporal delays between 
repetitions improves memory, particularly when the relative dif- 
ference between the temporal lag during study and the temporal 
lag between study and test is increased (e.g., Crowder, 1976; 
Glenberg, 1976; Murray, 1983). However, in the current exper- 
iments we did not see such an effect of increased retention 
interval, which questions the importance of temporal spacing 
during study for the benefits of interleaved study in category 
learning. This finding is in agreement with recent results by 
Kang and Pashler (2012), and Birnbaum etal. (2013) showing 
that introducing an additional temporal delay between presen- 
tations during interleaved study results in memory performance 
similar to that of a blocked study condition (i.e., decreases the 
interleaved advantage). One possibility is that 24 h is too short 
and with longer retention intervals an advantage of longer tem- 
poral spacing between categories would be seen. Though this 
remains an open question for future research and the exact for- 
getting function for this type of stimuli is unknown, we believe 
that it is unlikely that longer delays would yield interleaved 
study benefits since 24 h has been demonstrated to be suffi- 
cient before (Ste-Marie etal., 2004) and previous studies show 
noticeable increases in the benefits of short temporal spacing 
during study with 24-h retention intervals (see Cepeda etal., 
2008). 

However, even though temporal spacing by itself might not 
play a fundamental role in the interleaved advantage seen thus far, 
the importance of the temporal delay between repetitions dur- 
ing study should not be ignored. For instance, Vlach et al. (2012) 
showed that introducing a temporal delay between different exem- 
plars of the same category resulted in improved performance in 
a 15 min delayed generalization test. The authors taught 2 year- 
old children eight different categories organized around shape, 
each containing four similar exemplars varying in other prop- 
erties (color, texture, and size). Different groups of children 
learned the categories either by studying all the exemplars simul- 
taneously, individually blocked by category, or spaced (similar 
to the blocked condition but a play time was introduced after 
each naming trial). Children were tested (1) immediately after 
learning each category (i.e., after learning the first category a 
test session for that category would take place, prior to teach- 
ing the next category), and (2) 15 min later. For immediate 
tests, simultaneous presentation resulted in better generalization 
performance. Interestingly, 15 min later, only children in the 
spaced condition were able to generalize the categories learned 
above chance level. In fact, performance in the spaced condi- 
tion did not seem to diminish from the first to second test, 
while it decreased considerably for both blocked and simultaneous 
presentations. 

Birnbaum etal. (2013) found similar results with college stu- 
dents using natural categories. In one experiment the authors 
contrasted blocked and interleaved study when implemented 
contiguously with another condition in which a temporal 
delay was introduced between repetitions either of the same 
category (blocked -|- spaced) or different categories (inter- 
leaved -|- spaced). While interleaved -|- spaced resulted in worse 
performance than interleaved (for similar results see Kang and 



Pashler, 2012), the opposite pattern was seen for the blocked 
study conditions, i.e., blocked -|- spaced resulted in better 
performance than blocked study. This evidence across devel- 
opment and stimuli makes it apparent that forgetting and 
retrieval of information during study might play a role in 
learning differences seen with different sequencing schedules 
during study. As we mentioned in the Section "Introduction," 
participants might engage in a process of interactive recall 
in which features of the previous encounter with that cate- 
gory are recalled when a new item of the same category is 
presented. 

Overall, the present results are in agreement with the attentional 
bias hypothesis proposed by Carvalho and Goldstone (2014a,b) 
that predicts that the benefits of interleaved vs. blocked study are 
the result of an attentional biasing process taking place during 
the study phase. The attentional bias hypothesis proposes that 
during inductive category learning, learners tend to establish rela- 
tions between the current example being studied and the previous 
one. If the two objects belong to the same category, the learner's 
attention will be focused on similarities. If, conversely, the two 
stimuli belong to different categories, the learner's attention will 
be focused on the differences between the two objects. In this 
way, across time, attention will be increasingly biased towards 
relevant within-category similarities and between-category dif- 
ferences. This will affect category representation, which will, in 
turn, affect category encoding and recollection. With each new 
trial, categorization relevant properties will be progressively bet- 
ter encoded while irrelevant ones wiU be poorly or not encoded 
at all. Thus, blocked study emphasizes mostly similarities within 
categories, benefiting the acquisition of low similarity categories, 
while learning high similarity categories will be improved by 
attending to differences between categories during interleaved 
study. 

In this work we used novel, lab generated, category stimuli 
presented briefly on the screen. While this type of stimulus and 
procedure matches current research in the concept learning liter- 
ature, it may limit generalization. It is possible that using natural 
categories not defined by a rule, in which the stimuli are presented 
for a longer period of time or participants do not have to guess the 
category assignment during study, might provide different results 
(but see, Carvalho and Goldstone, 2011, 2014b; Kang and Pash- 
ler, 2012; Birnbaum etal, 2013; Rawson etal., 2014). Moreover, 
it is possible that the Refresher included before the 24 h-delayed 
test interacted with the type of study during the initial training 
session influencing the results seen for the delayed test. While this 
hypothesis cannot be ruled out by the present results, given that 
the same schedule of study was used during the Refresher as during 
the initial study session and the Refresher by itself did not allow 
participants to learn the category structures (Experiment IB), we 
believe the possible influence of the Refresher is minimized. In 
addition, the inclusion of a Refresher might present added educa- 
tional validity to the results presented here. Students often review 
the concepts immediately before the examination, regardless of 
when the initial study took place. 

As a theoretical framework, one possible way to integrate the 
benefits of temporal spacing and the benefits of sequential com- 
parisons is by hypothesizing that they result from different learning 
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processes, happening simultaneously during category acquisition. 
Successfully learning new categories can be achieved by encoding 
the relevant features and rules or by encoding individual exemplars 
that will be compared to novel instances for novel categorizations 
in the future. Within exemplar models of category learning, both 
of these alternatives would depend on whether one feature (or 
set of features) was selectively attended during study, or all fea- 
tures were equally weighted (Nosofsky et al., 1989; Nosofsky, 1991; 
Kruschke, 1992). 

At a first pass, learners might try to identify and isolate the 
relevant properties of the stimuli for categorization. The relevant 
properties are the similarities within categories for low- similarity 
categories and differences between categories for high-similarity 
categories. Identifying these properties will be promoted by spe- 
cific sequential comparisons as discussed before. If learners are 
successful, these relevant parts will receive greater attentional 
resources and be more efficiently encoded. Participants can then 
look for those when categorizing novel stimuli during a subse- 
quent transfer task. However, under some situations (blocked 
study of high similarity categories and interleaved study of low 
similarity categories), the relevant properties do not receive as 
much attention. This might lead participants to encode more 
features of each individual exemplar - a prediction derived 
from exemplar models assuming equally distributed attentional 
weights to all the features. This encoding would be improved 
by adding temporal spacing between presentations, which will 
result in increased effort in retrieving previous encounters dur- 
ing the recursive retrieval process and thus a better encoding 
of each stimulus (Bjork and Allen, 1970; Cuddy and Jacoby, 
1982; Krug etal., 1990). These exemplar memories of each 
stimulus can then be used to categorize new stimuli during 
transfer. 

Coherent with this proposal, in Experiment lA as well as in 
previous work (Carvalho and Goldstone, 2014a), when low sim- 
ilarity categories where used, memory for old items was best 
following interleaved study than blocked study. This, although 
not definitive, is indicative that, when abstracting the relevant fea- 
ture during study is not possible, learners might encode the entire 
stimulus, benefiting from manipulations that increase memory 
for individual stimuli. Perhaps a critical difference between these 
two processes is whether category abstraction is possible during 
study, which allows for encoding only the relevant features, or 
takes place only during test. This might be analogous to the results 
demonstrating differential exemplar memory for items that fit an 
abstracted categorization rule and those which do not (Palmeri 
and Nosofsky, 1995; Blair and Homa, 2003; Sakamoto and Love, 
2004). 

An important venue for future work would be to systemati- 
cally contrast memory and generalization for different category 
structures by increasing and decreasing temporal spacing between 
successive presentations. One prediction deriving from the pro- 
posal presented here would be that memory for the relevant 
feature encoded during study would be better for blocked study 
of low similarity categories and interleaved study of high sim- 
ilarity categories. Conversely, memory for the whole exemplars 
would be better for interleaved study of low similarity categories 
and blocked study of high similarity categories. Additionally, 



increasing the temporal spacing would have a positive effect 
for individual memories of each stimuli studied while a neg- 
ative effect on memory for the abstracted category-relevant 
feature. 

Information is usually presented to us in a structured, ordered, 
way and it is likely that this order wiU shape how and how 
well we learn. In inductive category learning, the sequence of 
category examples has the potential to change what is encoded 
(Elio and Anderson, 1984; Medin and Bettger, 1994). Different 
schedules promote different attentional biases due to different 
sequential ordering, and change how information is encoded and 
remembered due to different temporal spacing between category 
repetitions. The results presented here show that increasing the 
temporal delay between study and test does not change the dif- 
ferential benefits of interleaved over blocked study for different 
types of categories. However, we propose that even though these 
results are consistent with the idea that the spacing effect does not 
play a role in the interleaved advantage for our task, retrieval and 
forgetting during study are likely to play a role in study sequencing 
effects in category learning. We presented a conceptual framework 
that integrates the effects of temporal spacing between repetitions 
during study as well as exemplar contrast. 
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